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Cascade Learning by Optimally Partitioning 
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Abstract —Cascaded AdaBoost classifier is a well-known ef¬ 
ficient object detection algorithm. The cascade structure has 
many parameters to be determined. Most of existing cascade 
learning algorithms are designed by assigning detection rate and 
false positive rate to each stage either dynamically or statically. 
Their objective functions are not directly related to minimum 
computation cost. These algorithms are not guaranteed to have 
optimal solution in the sense of minimizing computation cost. 
On the assumption that a strong classifier is given, in this 
paper we propose an optimal cascade learning algorithm (we 
call it iCascade) which iteratively partitions the strong classifiers 
into two parts until predefined number of stages are generated. 
iCascade searches the optimal number r» of weak classifiers of 
each stage i by directly minimizing the computation cost of the 
cascade. Theorems are provided to guarantee the existence of 
the unique optimal solution. Theorems are also given for the 
proposed efficient algorithm of searching optimal parameters 
ri. Once a new stage is added, the parameter r, for each 
stage decreases gradually as iteration proceeds, which we call 
decreasing phenomenon. Moreover, with the goal of minimizing 
computation cost, we develop an effective algorithm for setting 
the optimal threshold of each stage classifier. In addition, we 
prove in theory why more new weak classifiers are required 
compared to the last stage. Experimental results on face detection 
demonstrate the effectiveness and efficiency of the proposed 
algorithm. 

Index Terms —AdaBoost, cascade learning, classifier design, 
object detection. 


I. Introduction 

R OBUST and real-time object detection is a key problem 
in computer vision tasks such as vision-based Human 
Computer Interaction (HCI), video surveillance and biomet¬ 
rics. Robustness of an object detection system is mainly 
governed by the robustness of extracted features and the 
generalization ability of employed classifiers. The detection 
efficiency is determined by the types of features, the manner of 
the features to be extracted, and the structure of the classifiers 
id, a, a. For example, it is well known that the features 
can be computed by the trick of integral image, which is 
suitable for efficient object detection. However, the structure 
of classifiers is also important for efficient object detection. 
For example, AdaBoost classifiers with cascade structure have 
greatly contributed to real-time face detection m, I32j, ®, 

ITT1 . human detection Q, E3, EH), E3, 0, 0, 

etc. With cascade structure, a large fraction of sub-windows 
can be rejected at early stages with a small number of weak 
classifiers. Only the sub-windows of true positives and those 
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similar to true positives can arrive at later stages. However, 
how to design an optimal cascade structure is an open problem 
which is the focus of this paper. 

Cascade learning is the process of determining the pa¬ 
rameters of a cascade in order to improve the efficiency of 
AdaBoost classifier. The cascade parameters mainly include 
the number of stages, the number of weak classifiers in each 
stage, and the threshold for each stage. However, most of 
existing cascade learning methods are not directly formulated 
as a constrained optimization problem. Though more efficient 
than the non-cascade one, they are not guaranteed to be the 
best in the sense of maximizing detection efficiency under 
acceptable constraints. Usually, there are many hand-crafted 
parameters which are chosen according to one’s our intuition 
and experience. The performance of the cascade AdaBoost 
relies on one’s insight into the cascade structure. As Saberian 
and Vasconcelos mentioned HD. the design of a good cascade 
can take up several weeks. In addition, some useful intuitions 
are not justified in theory. 

To overcome the above problems, we formulate cascade 
learning as a process of learning the parameters of a cascade 
by minimizing the computation cost with some certain con¬ 
straints. 

In summary, the contributions and characteristics of the 
paper are as follows. 

1) We transform the strong classifier of regular AdaBoost 
into an optimal cascade classifier. That is, the result of 
regular AdaBoost is the input of our cascade learning 
algorithm. In the sense of detection rate and rejection 
rate, we use cascade AdaBoost to approach its non¬ 
cascade one (i.e., regular AdaBoost) with minimum 
computation cost. 

2) The objective function of our method is just the com¬ 
putation cost of a cascade. In contrast, most of the ex¬ 
isting algorithms are designed by empirically assigning 
detection rate and false positive rate to each stage either 
dynamically or statically. Existence and uniqueness of 
the optimal solution are analytically proved. 

3) To design a one-stage cascade structure, we propose to 
partition the strong classifier //(x), a combination of 
weak classifiers hi,..., hr, into left part iFi(x, r i) and 
right part IJr(x, tt) at partition point r± (see Algorithm 
1 and Fig. 1). The optimal partition point n is found by 
minimizing the objective function fi(r) which stands 
for the computation cost of the cascade classifier. We 
theoretically (i.e.. Theorem Q} prove that /i(r) exists 
a unique solution. Moreover, we give a theorem (i.e.. 
Theorem O that gives a rough estimation of the optimal 
solution. 

4) To design a two-stage cascade structure, we propose to 
further partition right classifier H r(x, ri) into two parts 
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at partition point r 2 . The partition iteratively continues 
(see Fig. 4). This algorithm is not globally optimal if 
r\ is fixed while r 2 is considered as a variable. To 
obtain global optimization, we further jointly model the 
computation cost f(r\ , 7 - 2 ) with variables both n and r 2 . 
We prove that f(r 1 , 7 - 2 ) has a unique minimum solution 
(see Theorem [7]). An iterative optimization algorithm 
(i.e.. Algorithm 2) is proposed to find the optimal 
solution. Theoretical analysis (i.e.. Theorems I9lfl2li is 
given that n decreases in each iteration where is 
fixed and decreases in each iteration where r\ is 
fixed. We call it decreasing phenomenon. Such globally 
optimal two-stage cascade learning algorithm can be 
easily generalized to multi-stage one (i.e.. Algorithm 3). 

5) Moreover, we contribute to learning the optimal thresh¬ 
old ti of each stage classifier for minimizing computa¬ 
tion cost fs of the cascaded classifier. We prove that the 
computation cost decreases with the stage threshold ti 
(i.e., Theorem fl3l). Based on this theorem, we develop 
an effective threshold learning algorithm (i.e.. Algorithm 
4) whose core is properly decreasing tj. Though this 
algorithm is not globally optimal, it is very effective. 
We call the proposed algorithm (i.e.. Algorithm 4 and 
the procedure in Fig. 10) iCascade. 

6) We prove in theory why more new weak classifiers are 
required compared to the previous stage (i.e.. Theorem 
[5}. In addition, we also theoretically prove why cascade 
AdaBoost is more efficient than its non-cascade one. 
Though the results and phenomena can be intuitively 
understood, we are the first to theoretically justify them 
to be the best of our knowledge. 

II. Related Work 

This section briefly reviews some existing work related to 
cascade leaning. 

Most of existing cascade learning algorithms can be called 
DF-guided (where ”DF” stands for Detection rate and False 
positive rate) method pioneered by Viola and Jones J8j. In the 
learning step, DF-guided method selects weak classifiers step 
by step until predefined minimum acceptable detection rate 
and maximum acceptable false positive rate are both satisfied. 
We call this method VJCascade ED ■ 

Variants of VJCascade have been proposed to select and 
organize weak classifiers. BoostChain [ 17J improves VJCas¬ 
cade by reusing the ensemble score from previous stages 
to enhance current stage. Brubaker et al. m called such 
a technique BoostChain recycling. Similar to BoostChain, 
SoftCascade also allows for monotonic accumulation of in¬ 
formation as the classifier is evaluated CD. In Multi-exit 
AdaBoost (22), node classifier also shares overlapping sets 
of weak classifiers. FloatBoost HI as well as Boost-Chain 
uses DF-guided strategy to design the cascade. But different 
from VJCascade, FloatBoost uses backtrack mechanism to 
eliminate the less useful or even detrimental weak classifiers. 
Wu et al. (20j| employed Forward Feature Selection (FFS) 
algorithm to greedily select features. Wang et al. 0 developed 
an asymmetric learning algorithm for both feature selection 


and ensemble classifier learning. FisherBoost £D uses column 
generation technique to implement totally-corrective boosting 
algorithm. To decrease the training burden caused by the large 
number of negative samples and over-complete features (e.g., 
Haar-like features), some algorithms use only a random subset 
of the feature pool fTOl . fl6l . 

Endeavor has also been devoted to adjust the thresholds of 
stages of a cascade structure which is also called the thresholds 
of node classifiers. On the assumption that a full cascade has 
been trained by VJCascade algorithm, Luo lfl9l proposed to 
jointly optimize the setting of the thresholding parameters of 
all the node classifiers within the cascade. Waldboost algo¬ 
rithm utilizes an adaptation of Wald’s sequential probability 
ratio test to set stage thresholds [24]. Brubaker et al. proposed 
a linear program algorithm to select weak classifiers and 
threshold of a node classifier Col, ED. 

Though most of existing methods are DF-guided, 
computation-cost guided (i.e., CC-guided) methods were also 
developed. Chen and Yuille | [23l gave a criterion for designing 
a time-efficient cascade that explicitly takes into account the 
time complexity of tests including the time for pre-processing. 
They designed a greedy algorithm to minimize the criterion. 
But each stage in this method is constrained to detect all 
positive examples, which leads it to miss opportunity to 
improve detection efficiency 03- The loss function of Cronus 
cascade learning algorithm is a tradeoff between accuracy 
(training error) and computation cost IB) . CSTC (i.e., Cost- 
Sensitive Tree of Classifiers) combines regularized training 
error and computation cost into a loss function 03 . Compared 
to VJCascade-like method, CSTC is suitable for balanced 
classes and specialized features 03- 

In contrast to the above methods, the objective function (i.e., 
loss function) of our method is just the computation cost and 
the detection accuracy can be naturally guaranteed. In addition, 
global solution instead of local one can be obtained in our 
method. 

III. Proposed Method: One-Stage Cascade 

The goal of cascade learning is to lean a cascade structure 
in order to correctly reject negative sub-windows and accept 
positive sub-windows as fast as possible. Generally, the cas¬ 
cade structure is determined by the number of stages and the 
number of weak classifiers in each stage. 

Most of existing methods design or learn the cascade 
structure by assigning minimum acceptable detection rate and 
maximum acceptable false positive rate for each stage. In this 
paper, we propose a novel cascade learning method in which 
it is not necessary to assign such acceptable detection rates 
and false positive rates. Instead, we learn the parameters of a 
cascade by directly minimizing the computation cost. 

In this section, we describe the proposed one-stage cascade 
learning algorithm which is the foundation of our multi-stage 
cascade learning algorithm. 

A. Testing Stage 

In our method, cascade AdaBoost is considered as an 
estimation of regular AdaBoost. A good cascade structure can 
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achieve the same detection accuracy as AdaBoost with small 
computation cost. Therefore, we begin with describing the 
form of the strong classifier of regular AdaBoost. 

Let H (x) be the strong classifier obtained by an AdaBoost 
algorithm. The strong classifier // (x) is composed of T weak 
classifier /ij(x) £ {1,-1} with weights ap. 

T 

H(x) = YaMx). (1) 

i=i 

Generally, the weights of the weak classifiers satisfy 

a± > a.2 > • ■ • > cut > 0 , ( 2 ) 

and 

T 

= 1- (3) 

i=i 

Let ((x) £ {1, —1} be the class label of a feature vector. The 
decision rule of the strong classifier H(x) is: 

f 1, if 7T(x) = <*ihi{x) > t, 

l(x) = < (4) 

-1, if H(x) = onhi, (x) < t , 
f 2=1 

where t is a threshold balancing the detection rate and false 
positive rate. 

In one-stage cascade structure, there is only one stage in 
which a small number (i.e., r) of weak classifiers are combined 
for classification. The core of the proposed one-stage cascade 
is to determine an optimal r which divides the strong classifier 
H(x) into left part H R (x) and right part H R (x): 

#( x ) = H l (x) + H r (x), (5) 

r 

H l (x) = y ^ajhijx), (6) 

2=1 
T 

H R (x) = ^2 Uihi{x). (7) 

i=r+l 

To reject true negative sub-windows with less computation 
cost, we propose to use the maximum of H R (x) to approxi¬ 
mate the value of H R (x): 

/ T \ T 

m&xH R (x) = max £ aih(x) J — ^ ' Oii. (8) 

\ 2 =r+l / i=r-j-1 

We denote the maximum by maxTTjj(x). With ma,xH R (x), 
it is guaranteed that all the true negative sub-windows can be 
correctly rejected if the following inequality holds: 

H L (x,r) + maxH R (x,r) <t. (9) 

That is, some sub-windows can be rejected by using merely 
H r (x) and maxH R (x) instead of both H R (x) and H R (x). 
Consequently, the computation cost is significantly reduced. 

The rest sub-windows not satisfying © have to be classified 
using both H R (x) and H R (x) (i.e., the strong classifier). If 
the sum of H R (x) and H R (x) is not larger than t, i.e., 

H l (x, r) + H r (x, r) = H(x) < t, (10) 



Fig. 1. Proposed method: one stage cascade AdaBoost with a given r. 


Algorithm 1 One-stage cascade 

1 

if H L (x,r) 

+ maxH R (x, r) < t. 

2 

then l(x) 

= - 1 , 

3 

else (i.e., Hl(x, r) + ma xH R (x,r) > t ) 

4 

if H l (x, 

r) + H R (x,r ) < t 

5 

l(x) = 

- 1 , 

6 

else (i.e.. 

H L (x,r) + H r (x, r) > t) 

7 

l(x) = 

1 . 


then the sub-window corresponding to the feature vector x can 
be finally classified as negative sub-window. Otherwise (i.e., 
the sum is larger than f), it is classified as positive sub-window. 
The algorithm of one-stage cascade is given in Algorithm 1. 
Equivalently, the flow-chart is shown in Fig. 1. Note that t — 
ma xH R (x,r) can be viewed as the threshold for H R (x,r). 
The issue of how to set the threshold is addressed in Section 
V.C. 


B. Training Stage: How to Select an Optimal r 

In Fig. 1, it is assumed that r and max II R (x. r) are given. 
In this sub-section, we describe how to choose an optimal r. 
ma xH R (x,r) can be easily computed from training samples 
once r is given. In the training stage of cascade learning, it 
is assumed that the strong classifier H(x) = a ihi{x) is 
obtained by a regular AdaBoost Algorithm. 

Given r, a p fraction of true negative sub-windows can be 
rejected by using left classifier IIrJx) (i.e.. ©)• The fraction 
p is called rejection rate and defined by: 


p{x) 


r 

otihi(x) -fma xH R (x,r) < t ) 

X 2=1 


£/(/(x) ==-l) 


( 11 ) 


where I(condition) is 1 if the condition is satisfied and 0 
otherwise. == “1) t l ie num t> er of all true 

negative sub-windows. Eq. ( fill shows that p is dependent on 
r. 

Obviously, the fraction of true negative sub-windows clas¬ 
sified by using both left and right classifiers is 1 — p. The 
criterion for choosing r is to minimize the overall computation 
cost / consisting of the cost f L of computing H R (x, r ) in © 
and the cost f R of computing both 7 Tl(x, r) and H R (x 1 r) in 

Suppose that all the weak classifiers have the same compu¬ 
tation complexity. Then the computation cost is determined by 
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Fig. 2. A representative form of function p{r) which can be simplified as 
combination of two linear functions: p\(r) = ar with r < r* and P 2 (r) = 1 
with r > r*. 

the number of weak classifiers. A fact is that /f grows with 
p and r: 

fi(r,p) =p(r + c), (12) 

and fi grows with 1 — p and T : 

f*(r,p) = (l-p)(T + 2c). (13) 

In © and ©. c is the computational cost of checking 
either inequality ([9]) or inequality (fl0l> holds. Usually, the 
computation cost C of a weak classifier is bigger than c. Let 
C = 1, then c < 1. Note that c is not involved in computing 

H L (x.,r) and H R (x,r). 

The goal is to minimize the following object function: 

fi(r,p) = fi(r,p) + f?(r,p) = p(r + c) + (1 -p)(T + 2c). 

(14) 

To solve this optimization problem, it is necessary to reveal 
the relationship between r and p. The parameters r and p are 
correlated and the correlation can be expressed as a function 

p(r, max H R (x, r)). 

As max(7T/{(x, r)) (its upper bound is ^r= r +r a i) de- 
creases with r, a larger number of negative sub-windows will 
be rejected by ©. It is straightforward that the fraction p of 
negative sub-windows satisfying © grows with r. Experimen¬ 
tal results also show that p monotonically increases with r. The 
relationship between p and r is nonlinear. Fig. 2 illustrates a 
typical trend that how p varies with r. It can be seen that p 
grows quickly from 0 to the value (e.g., 0.99) close to 1 when 
r changes from 1 to a small value r* (e.g., 10). But p becomes 
stable when r is larger than r*. The reason is that the first r* 
weak classifiers hi with larger weights cty play much more 
important role than the rest weak classifiers. 

Mathematically, r* is defined as the minimum r which 
satisfies p « 1 or equivalently 1 — p(r) < e with e being 
a small number (e.g., 0.01): 

r* = argmin{rjl — p(r) < e}. (15) 

r 

We call r* the saturation point of p(r). 

Though p(r) is in fact a high-order curve, it can be well 
modelled by combining of two linear functions: pi(r) = ar 
with r < r* and p 2 (r) = 1 with r > r*(see Fig. 2). As T is 
a large number, then r* <C T. 

It is reasonably assumed that the function p{r) satisfies the 


fowllowing conditions: 


p{r l) < p(r 2 ), if n < r 2 , (16) 

p'{ri) > p'[r 2 ) > 0, if n < r 2 , (17) 

p(T) = 1, (18) 

p{ 0) = 0, (19) 

p(T) = 0 , ( 20 ) 

p'{ 0) > 0, (21) 

3 /( 1 ) > 0 . ( 22 ) 

(fl6] | states the monotonicity of p(r). ( 11 7b tells that the slope 
of p{r) decreases with r. (l20l > shows that the slope is zero 

at r = T while it is extremely large at r = 1. It is noted 


that ©-© will be used as assuption of the theorems of the 
proposed methods. 

According to Fig. 2, p(r) has the following properties: 

r* « T, as T is a large number , (23) 

p{r) ~ pi(r) = ar , if r < r*, (24) 

P{r) ~ P 2 {r) = 1, if r>r*, (25) 

which will be used as assumption of Theorem |2] 

After each pair of (r, p) are known, the value of _/j can be 
obtained. Theorem Q] tells that there exists a unique minimiza¬ 
tion solution. 

Theorem 1. /j(r) = p(r)(r + c) + (1 — p(r))(T + 2c) has a 
unique minimum solution r\. Moreover, f\{r) monotonically 
decreases with r until r = r\ and then increases with r. 

Proof. We first prove the existence of the minimum solution 
and then give the evidence of the uniqueness of the minimum 
solution. 

Existence: 

V fi(r) =p(r)(r + c) + (1 - p{r)){T + 2c), 

:.f[(r) = p'(r)(r —T — c) +p(r). 

Consider the value of the derivative f[(r) when r ap¬ 
proaches 0: 

lim/((r) = /{(O) = p'(0)(0 - c - T)+p( 0). (26) 

i—»o 

Because p(0) = 0 (i.e., (TRJb ) and p'(0) 3> 0 (i.e., (DTlt ). 
Therefore, it holds: 

lim f[(r) = f[( 0) = -p'(0)(T + c) < 0. (27) 

r —>0 

Now consider the value of the derivative /{(r) when r 
approaches T: 

lim/fir) = f[(T) = p(T) — p'(T)c « 1 - 0 > 0. (28 ) 

Because lim f[(r) < 0, lim fUr) > 0 and f[(r) is continu- 

v —^0 t— yT 

ous function, it must exist a r\ £ [1,T] such that f[(r\) = 0. 
The n is at least a local minimum, which shall be the global 
minimum if the local minimum is unique. 

Uniqueness (Proof by contradiction): 

Suppose that there are two local minimums ri and r 2 with 
T\ < r 2 . Then it holds that /{(n) — f[(r 2 ) = 0. 

Now investigate the value of f{(rf) — f[(r 2 ) = [p{r\) — 
p(n 2 )] - [p'(ri)(T + c-n)-p'{r 2 )(T + c-r 2 )\ if n < r 2 
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Fig. 3. A representative form of /i(r) 

is true. 

ri < r 2 , 

p'( r 1 ) > p'( r 2 ) >0, T + c — n>T + c— r 2 >0, 
p{n) < p(r 2 ), 

■ f(r 1 ) - f'(r 2 ) < 0. 

This contradicts /'(rr) — f'{r 2 ) = 0, Therefore, tt < r 2 is 
wrong. 

Similarly, we can prove that n > r 2 is wrong. Conse¬ 
quently, n = r 2 is true, meaning a unique solution. □ 

Fig. 3 shows a representative form of /i(r), it has a unique 
minimum solution. 

Theorem 2. Let r* be a saturation point of p(r) (see ( 1751 )) 
and assume that p(r) can be modelled by combining pi(r) = 
ar where r < r* with p 2 (r) = 1 where r > r*(see Fig. 2 
for illustration). Then the saturation point r* is the optimal 
minimum solution n = arg min /1 (r). 

r 

Proof. Note that fi(r) = p(r)(r + c), f^(r) = (1 — 
p(r))(T + 2c), and /i(r) = / 1 i (r,p) + f?(r,p) = p(r)(r + 
c) + (l-p(r))(T + 2c). 

Case 1: For r* < r <T, because p 2 (r) = 1 and 1— p 2 (r) = 
0, so we have fi(r) = p 2 {r)(r + c) = r + c, f^{r) = 0, and 
hence /j (r) = r + c. Therefore, the optimal solution r* R for 
r > r* is r* itself. That is, 

r* R = arg min /1 (r) = r*. (29) 

r*<r<T 

Case 2: For 0 < r < r*, because p(r) « Pi(r) = ctr, we 
have: 

fi(r) = Pi(r)(r + c) = ar(r + c), 

= (1 ~p(r))(T + 2c) = (1 - ar)(T + 2c), 
fi(r) = fi(r,p) + f\(r,p) = ar 2 - a{T + c)r + (T + 2c), 
f[ = 2 ar — a(T + c) = 0 => f = argmin/i(r) = (T + c)/2. 

r 

Because r* < f, /i(0) = T + 2c, and /i(r) monotonically 
decreases with r when r <r, the minimum value r* L of fi (r) 
in the range of 0 < r < r* is r*. That is, 

r* L = arg min f 1 (r)=r*. (30) 

0 <r<r* 

It is observed from (|29| > and ( 1301 1 that the minimum solutions 
for 0 < r < r* and r* < r < T are identical to r*. 
Consequently, r* = arg min /i(r). 

Therefore, optimal minimum solution r\ = argmin/i(r) 
is r*. r □ 


IV. Proposed Method: Local-Minimum Based 
Multi-Stage Cascade 

In this section, we extend one-stage cascade learning to 
multi-stage cascade learning. 

A. Testing Stage 

From (0, one can see that one-stage cascade is obtained by 
splitting -ff(x) into iTi(x,ri) and Hp>(x,ri) where n is the 
optimal r (i.e., r\ = argmin/i(r)). We add a superscript ”1” 

r 

to Hl and Hu so that one explicitly knows that H )_(x, r \) 
and H R (x,ri) correspond to stage 1. Multi-stage cascade is 
obtained by iteratively splitting the right classifier H R . 

As shown in Fig. 4, the sub-windows not rejected by stage 
1 are fed to stage 2. The second stage is obtained by further 
dividing the right classifier H R (x, rf) into two parts at the 
partition point r 2 (r 2 > rf): 

= f?£(x,r 2 ) +fT|(x,r 2 ), (31) 

ri 

Hl(*,r 2 )= ( 32 ) 

i=r ±+1 
T 

H R (x,r 2 )= Y a i^( x )- (33) 

i=r2-\-l 

In stage 2, the sub-windows are rejected if the following 
inequality holds: 

#i( x ,n) + + max(F|(x,r 2 ))] <t. (34) 

( 13 41 is equivalent to 

7T L ( x ,r 2 ) + max(fT fl (x,r 2 )) < f, (35) 

because 

H L (*,r 2 ) = Hl(x,n) + Hl(x,r 2 ). (36) 

But d35l > is more time-consuming than 1341 because r 2 (r 2 > 
rr) weak classifiers are used to compute ITl(x, r 2 ) in (l35l) 
whereas in (f34l) //) (x, r\ ) has been computed in stage 1 and 
if£(x, t 2 ) can be efficiently computed using as small as (r 2 — 
rr) weak classihers where H R (x.,ri) can be reused in stage 
2. 

Analogously, the left classifier in stage i — 1 can be 
represented by the left and right classihers in stage i: 

fTjj _1 ( x ,ri-i) = H l L {x,n) + H l R {x,n). (37) 

The block diagram of the multi-stage cascade is shown in 
Fig. 4 where the rejection rate p, is the ratio of sub-windows 
rejected in Stage i. In stage 1, p-\ fraction of sub-windows are 
directly rejected and 1 — p \ fraction of sub-windows are fed 
to stage 2. Among the (1 — pfjw sub-windows, p 2 fractions 
are rejected by stage 2 and 1 p 2 fractions are considered as 
positive-class candidates and therefore are fed to stage 3. This 
means that (1— pi)(l— p 2 ) fraction of total w sub-windows are 
to be classified by stage 3. Because pi in stage i is dependant 
on Pi-i in stage * — 1, we explicitly express pi as p(Vi|? 4 _i) 
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Stage 1 


Stage 2' 


Stage 3 

I 

I 

Stage S 


^r 



Fig. 4. Testing process of local-minimum based multi-stage cascade Ad- 
aBoost. 


where 


fi = 


i -1 


II (i-ft-foki-1)) 


i=i 


P»(r»ki-i)(n +ic), (45) 


/■« - 
Js ~ 


.JU 0) 


3 = 1 


(T+(S+l)c). (46) 


Let n,..., rg_i define an S— 1 stage cascade structure whose 
computation cost is fs-i(ri■ ■ ■ ., rg_i): 

S -1 

i =1 

(47) 

7jCpg(rg|rg_i) > c/(T + c — rg), then we have 

fs-i{n,. ■ . ,rg_ i) > /s(ri,... ,r S -i,r s ). (48) 



Fig. 5. The form of pi(ri\n~i) and its properties. If ri—\ < r%— 1 , then 
Pi(nFi-i) >Pi(r*|ri_i) and p-(ri|r<_i) < p-(rj|ri_i). 


when necessary. Specifically, the rejection rate p(r,;|r;_i) is 
defined as: 

Vi 

/(E afch-fc(x) + max(7Jg(x,ri)) < t) 

p(r»|ri_i) = r . fc .; 1 -. 

/( E afc^fe(x) + max(n r fl(x,r i _i)) > i) 

fc=i 

(38) 

Fig. 5 shows two representative curves of p(rj|rj_i). The 


properties of p(rj|rj_i) are summarized as follows: 

p'(Ti\ri-x) > 0, (39) 

lim p(r*|r*_i) = 0, (40) 

r»— >ri—\ 

lim p{ri\ri-i) = 1, (41) 

r<—S-T 

p(ri|fj_i) > p(r-i|ri_i), z/ r)_i < rj_i, (42) 

p'^iFi-t) <p'(ri|ri_i), z/ ri_i < r,_i. (43) 


We give a theoretical guarantee (i.e.. Theorem^ that adding 
a stage results in reduction in computation cost if certain 
condition is satisfied. 

Theorem 3. Let ri,... ,rs define an S stage cascade struc¬ 
ture whose computation cost is fs(f i, ■ ■ •, rs): 

s 

fs{ri, • • ■, r s ) = '52tf(ri,...,r i ) + fg (r s ), (44) 


Proof. 

.' fs-i(n,.. .,r S - 1 ) - /s(ri,... ,r S -i,r s ) 
= fs-i{rs-i) ~ fs(r i, ...,r s )- /# (r s ) 


[ps(rs\rs-i)(T + c - rg) - c], 


s-i 

= IK 1 -Pi) 

_i= 1 

.• 1 -Pj > 0, p s > 0, 

if Ps(rs\rs- 1 ) > c/(T + c- r s ), then /g_ 1 > /g. 


□ 


Note that if the computational cost c is omitted, then fs-i > 
fs as long as ps(rs\ r s-i) > 0- In this case, it is optimal that 
each stage contains a new weak classifier (i.e., the case S = T 
, rl = l,r2 = 2, ...,rr = T ). But c ^ 0 in practice, it is 
necessary to let S < T and find way to search the optimal 
values of n,..., rg. 


B. Training Stage: How to Select Optimal ri 

Section IV.A describes the testing stage of the proposed 
cascade method. Now we describe the training stage of the 
proposed method which is closely related to Section III.B. 

1) Existence and Uniqueness: Investigating Fig. 4, one can 
find that the cascade structure is completely determined once 
ri,... ,rs are known. Therefore, the main task of the training 
stage is to find the optimal n,..., rg. 

The ri in stage 1 is obtained by the method in Section III.B. 
Given ri, we learn the best r- 2 : 

r 2 = arg min/ 2 (r 1 ,r) = argmin/ 2 (r|r 1 ). (49) 

r r 

Similar to the proof of Theorem Q] it can be proved that 
/ 2 (r|ri) has a unique solution. 

Generally, r 7 ; is computed based on ri,..., r,;_i: 

r, : = arg min /,(n,..., r,_i, r) 

ri-i<r<T 

■ 11 i t (50 > 

= arg mm Ji(r\ri ,..., r*_i). 

In ( l50t . /i(r|r 1; ..., fj_i) is used to describe the assump¬ 
tion that r i,..., rj_i in the first i — 1 stages are given. If 
G, ■ ■ -,n- 1 and p(ri),p(r 2 |ri),... ,p(ri_i|r, ; _ 2 ) are known. 
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then fi(r\ri,... will be in the similar form as /i(r) 

(see lfl4l il: 




= E .fj L (n,---,rj) + f x R (n) 

j = 1 

= E //Or, ■■■,r j ) + ft{r 1 , • • •, n) + ) 

E /f(ri,...,rj) 


i=i 

II (! 

i=t 

n (! 

7=1 


PiO'iki-iXn + ic) 


(1 -pi( r il r t-i))( T + (* + l)c). 

(51) 

Because the items in bracket in <ED> are constant, so 
fi(r\ri,, r,_i) is in the similar form as fi(ri). Therefore, 
as a corollary of Theorem [Q we have the following theorem: 


Theorem 4. min ... , r*_i, r) = min fi(r\n,..., 

r r 

n- 1 ) has a unique minimum solution ri £ [rj_i, T\. Moreover, 
fi(r\ri, ..., n— i) monotonically decreases with r until r = n 
and then increases with r. 


Theorem |4] implies that /i(r|ri,..., rj_i) has the similar 
form as the curve in Fig. 3. 

2) Efficient Search: The search range of r 7 ; is (ri-i,T). 
However, because fi(r\r\,..., rj_i) monotonically decreases 
with r until r = r* and then increases with r, to find the 
unique minimum solution one can increase r from r,;_i with 
a small step and stop at the value once fi(r\ri ...., rj_i) no 
longer decreases. Therefore, the practical range is less than 
fa-1 ,T). 

The search range can be further reduced according to the 
following increasing phenomenon. 

Theorem 5. If = arg lim fi(r\n ,..., rj_i), 

r i _ 1 <r<T 

n-i = arg min fi-i(r\n,..., n- 2 ), r i+1 = 

Vi—2<r<T 

arg min /j_|_i(r|ri,... ,rf) and 2 n — i < T, then 

ri <r<T 

it holds: 


n - n -1 < r i+ 1 - n, (52) 

fj+i > 2r* - rj_i. (53) 

We define r$ = 0, so we have: 

r 2 > 2ri. (54) 



Fig. 6. p(rj_ 1 + Ar|rj_i) and p(r* + Ar|rj). 



Fig. 7. Illustration of Theorem [3] 


as: 


Ar* = argmin{Ar|l — p(n- 1 + Ar|rj_i) < e}, (55) 

Ar 

Ar* +1 = argmin{Ar|l — p(r 4 + Ar|rj) < e}. (56) 

Now investigate the curves of p(ri -1 + Ar|rj_i) and p(r 7 + 
Ar|rj) (see Fig. 6). According to the property (i.e.. (|42] >) of 
p(ri|ri_i), Ar* +1 > Ar* holds because r* > rj_i. □ 

Fig. 7 illustrates the nature of Theorem 0 where the ob¬ 
jective functions and estimated optimal solution at saturation 
points are shown. The relationship of rj_i, r,;, and r,; + i are 
n -1 - ri —2 < n - ?’i_i < r i+ 1 - ?’i (i.e., Arj_r < A n < 
Ar i+ i). 

According to Theorem 0 if n 

arg min /i(r|rr,..., r»_i) and rj_i = 

n-i<r<T 

arg min /j(r|ri,..., rj_ 2 ) are already known, then the 

r *_ 2 <r<T 

search range for r,j + i will be reduced to [r, + (r* — rj_i), T] 
(i.e., [2j-j — rj_i,T]) where r* — ri_i is called increasing step. 

The training process is shown in Fig. 8 where the increasing 
phenomenon is used for efficient minimization. 


Proof. This theorem can be proved by using Theorem[2]and 
the properties of pi( r l r i-i) anc l Pi+i( r \ r i) 

The curves of pi + i(r\ri) with r > r 7 and pi( r l r i-t) 
with r > n-1 have the similar shapes according to Fig. 
2. The difference between Pi(r\ri-i) and Pi+i(r\ri) can be 
characterized by their saturation points r* = argmin{r|l — 

r 

p(r\ri_i) < e} and r* +1 = argminjrll — p{r\ri) < e}. 
Define increasing step Ar* = r*— rj_i and A r* +1 = r*, 1 —r t , 
then r* = r^-i + Ar* and r* +1 = ri + Ar* +1 can be rewritten 


V. Proposed Method: Joint-Minimum Based 
Multi-Stage Cascade 

A. Existence and Uniqueness of a Jointly Optimal Solution 

The method in Section IV is a greedy optimization algo¬ 
rithm because it seeks an optimal r t on the condition that 
(ri,... ,rj_i) are known and fixed. The objective function 
is /j(rjri, ...,rj_i), i = 1 In this section, we give 

an algorithm for jointly seeking the optimal (ri,...,rs) that 
globally minimizes the objective function f(r\,rs ) instead 





























Fig. 8. Local-minimum based multi-stage cascade learning. 

of fs(r |ri, rs-i)- That is, the goal of joint optimization is 
to find (r*, ...,r* s ) = arg min fs{n, ...,r s ) 

r\,...,rs 

For the sake of clarity, we start with establishing a globally 
optimal two-stage cascade structure. The globally optimal 
cascade structure with more than two stages will be extended 
from the two-stage one. 

The goal of jointly optimal two-stage cascade learning aims 
at finding (Yj, r 2 ) = arg min / 2 (ri, r 2 ). 

1~1 ,V2 

Obviously, if both f 2 (r\ri) = ^(rnt 1 ) and f 2 (r\r 2 ) = 
f 2 (r,r 2 ) have unique minimization solutions, then f 2 (ri. r 2 ) 
has unique minimization solutions. / 2 (r|ri) means the objec¬ 
tive function of a two-stage cascade where the parameter r\ of 
stage 1 is known and the parameter r of stage 2 is a unknown 
variable. / 2 (rjr 2 ) stands for the situation where the parameter 
r 2 of stage 2 is known and the parameter r of stage 1 is a 
variable. The theorems related to the jointly optimization are 
as follows. 

Theorem 6. min/2(r,7'2) = min/ 2 (7jr 2 ) has a unique 

r r 

minimum solution r±. 

Proof. We first prove the existence of the minimum solution 
and then give the evidence of the uniqueness of the minimum 
solution. 

Existence: 

V / 2 (r|r 2 ) = pi(r)(r + c) + (1 - Pi(r))p 2 (r 2 |r)(r 2 + 2c) 
+(1 -pi(r))(l -p 2 (r 2 \r)){T + 3c), 
/ 2 ON) = P[(r){r -2c- T)+ Pl (r) 

+{T + c - r 2 )p[(r)p 2 (r 2 \r) 

-(T + c - r 2 )(l - pi{r))p' 2 (r 2 \r). 

Because the sum of rejected negative sub-windows of stage 
1 and stage 2 is a const p > 0 once r 2 is fixed: 

Pi(r) + (1 - pi(r))p 2 (r 2 \r) = pi(r 2 ) = p. (57) 


Computing the derivative of r to both sides of ( f57l ) yields: 

p'i(r)p 2 (r 2 \r) - (1 - Pi(r))p' 2 (r 2 \r) =p' 1 {r). (58) 

Therefore, we can get f 2 (r\r 2 ) as 

f 2 (r\r 2 ) = p[{r)(r - r 2 - c) + pi{r). (59) 

limpi(r) = 0 (see (1 1 9b ) and p\ (r) > 0 (see «U3), 
lim f 2 {r\r 2 ) = ~(r 2 + c)pj(O) < 0. 

r—f 0 

lim /,(r|r 2 ) = pi(r 2 + c) > 0. 

r—tr 2 ~\-c 

V lim f 2 (r\r 2 ) > 0, lim fo(r\r 2 ) < 0 and f 2 (r\r 2 ) is a 

r—yr2-\-c r—t 0 

continuous function, 

.■.It must exist a r x £ [1, r 2 + c) satisfying /2(ri|7'2) = 0 
and r\ = argmin/ 2 (r|?' 2 ). 

Uniqueness (Proof by contradiction): 

Suppose there are two local minimum solutions fi and n 
with f\ < ri. Then it holds that f 2 {ri\r 2 ) — f 2 (ri\r 2 ) = 0. 

Now investigate the value of f 2 (ri\r 2 ) — f 2 (ri\r 2 ) 
= \p'i{fi){f 1 -r 2 -c)-p' l (ri)(r 1 -r 2 -c)} + 

[pi(fi) -pi(ri)] if fi < n is true, 
fi < ri, 

.'. p'(fi) > p'(r i) > 0, fi — r 2 — c < n — r 2 — c < 0,0 < 
P(r i) <p(n), 

■■f2(h\r 2 ) ~ f 2 (r!\r 2 ) <0. 

This contradicts f 2 (ri\r 2 ) — f 2 (ri\r 2 ) = 0. Therefore, fi < 
r\ is wrong. 

Similarly, we can prove that fi > r\ is wrong. Conse¬ 
quently, fi = rq is true which means a unique solution in 
fi £ [1 ,r 2 + c). Because c is smaller than 1 (see the statement 
below (fl3l> ). It is equivalent that the unique solution ri is in 
the range [1, r 2 ) □ 

Theorem |6] tells that if the information of stage 2 is given, 
then one can find an optimal parameter r for stage 1 so 
that the computation cost f 2 of the final two-stage cascade 
is minimized. 

Theorem 7. f- 2 (ri,r 2 ) has a unique minimum solution 
(rl,r* 2 ). 

Proof. Because both min/ 2 (r|?'i) = min/ 2 (ri,r) (see 

r r 

Theorem 0 and min/ 2 (r|?' 2 ) = f 2 (r,r 2 ) (see Theo- 

r 

rem have unique minimum solutions, so min/ 2 (ri,r 2 ) 
has a unique minimum solution. That is f 2 (rf , ) = 

min f 2 (ri, r 2 ) = min min/ 2 (r 2 |ri) = minmin/ 2 (ri|r 2 ) 

r i,r2 T\ V2 V2 r i 

□ 

It is straightforward to generalize Theorem |7] to the follow¬ 
ing Theorem: 

Theorem 8. ..., rf) has a unique minimum solution 

B. How to Search the Jointly Optimal Solutions 

1) Algorithm: Theorem [8] guarantees the existence and 
uniqueness of jointly optimizing the stages of a cascade. 
In this section, we give algorithms (i.e.. Algorithms |2] and 
|3| for searching the solution and then theoretically justify 
the algorithms in theory. We start with the algorithm for 
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Algorithm 2 Globally optimal two-stage cascade learning. 

Input: 

Strong classifier 77 (x) = and its threshold 

t; 

A set of true negative sub-windows{x|((x) = —1}; 

Output: 

(r*,r%) = argmin = / 2 (ri,r 2 ); 
ri,r 2 

l: Initialization 

2: Search the optimal solution n of fi(r) for stage 1 in the 
range of (1,T): ri = arg min fi(r). 

3: Given n, search the optimal solution r 2 of 
/ 2 (r|?'i) for stage 2 in the range of [2ri,T): 
r 2 = arg min / 2 (r|ri). See Theorem [5] for the 

2r*i<r<T 

reason of r > 2ri. 

4 : Iteration 

5: Given r 2 , search the optimal solution f\ of / 2 (r|r 2 ) 

in the range of 1 < r < r\ for stage 1: f\ = 
arg min / 2 (rjr 2 ). Note that f\ < r\ (see Theorem[9}. 

l<r<ri 

An efficient search sUategy is decreasing r from n step 
by step until / 2 (r|r 2 ) does not decrease. / ■*— f 2 (ri\r 2 ). 
6: Given f\, search the optimal solution f 2 of / 2 (r|fi) 

for stage 2 in the range of f\ < r < r 2 : r 2 = 
arg min f 2 (r\f\). Note that r 2 < r 2 (see Theo- 

n <r <r-2 

rem [12l> . An efficient search sUategy is decreasing r 
from r 2 step by step until f 2 (r\fi) does not decrease. 
/ 4- / 2 (r|ri). 

7: Update n 4— fi, r 2 •(— f 2 . 

8: until / — / > n 
9: return r\ <— f\, f 2 . 


optimizing a two-stage cascade and then generalize it to multi¬ 
stage one. 

The task of jointly optimizing a two-stage cascade can be 
expressed as (r*, r %) = arg min f 2 (r±, r 2 ). The idea of our 

r i ,r 2 

optimization method is shown in Algorithm 2. 

The proposed Algorithm 2 is an alternative optimization 
procedure. In the initialization step, the solution r\ of the 
one-stage cascade learning is searched in the largest range 
1 < r < T: r\ = arg min^/i(r). The value of ri is shown 
in Fig. 9, where ”#1” means that r i is obtained firstly. The 
obtained r\ is used as the upper bound of the searching range 
for the better solution fi in line 5 of Algorithm 2. After r\ 
is given, line 3 of Algorithm 2 searches the optimal solution 
r 2 of f 2 (r\ri) for stage 2 in the range of 2ri < r < T: 
r 2 = arg min f 2 (r\r\ ). Based on (l54l> . the search range 

2n<r<T 

starts from 2r\. The value of r 2 is shown in Fig. 9, where 
”#2” means that r 2 is the second value obtained by Algorithm 

2 . 

In line 5 of Algorithm 2, r 2 is given and the task is to 
search the optimal solution f\ of f 2 (r\r 2 ) in the range of 
1 < r < r\ for stage 1: f i = arg min f 2 (r\r 2 ). Because 

l<r<ri 

r\ <C T, the search range 1 < r < r\ is much smaller than 
the one (i.e., 1 < r < T) in line 2. Theorem [9] guarantees 
ft < T\ for the first round of iteration, fi is the third value 



Fig. 9. Illustration of the intermediate values obtained by Algorithm 2. r\ 
and 7*2 are the outputs of Initialization. The sequence of the stage parameters 
are updated in the following turn: r\ —> 7*2 —> r\ —> f -2 —> r\ —> r 2 with 
fi < fi < r± and 7=2 < < r 2 

obtained by Algorithm 2 which is shown near ”#3” in Fig. 
9. Experimental results and intuitive analysis show that the 
absolute distance \fi — ri| from f\ to r\ is much smaller than 
the absolute distance |1 — fi \ from 1 to r\, the search strategy 
of decreasing r from r\ step by step until f 2 (r\r 2 ) does not 
decrease is more efficient than the one of increasing r from 1 
step by step until / 2 (r|r 2 ) does not increase. 

In line 6 of Algorithm 2, f i is given and the task is to search 
the optimal solution f 2 of f 2 (r\ri) in the range of r i < r < r 2 
for stage 2: f 2 = arg min / 2 (rjfi). Because r 2 < T, the 

ri<r<r2 

upper bound of the search range is much smaller than the one 
(i.e., T) in line 3. Moreover, as iteration runs, the updated 
r 2 becomes smaller and so the upper bound of search range 
for r 2 becomes smaller too. Theorem [TOl guarantees f 2 <r 2 . 
The value of r 2 is shown in Fig. 9 which is ”#4” obtained 
by Algorithm 2. Experimental results and intuitive analysis 
show that the absolute distance |f 2 — r 2 | from r 2 to r 2 is 
much smaller than the absolute distance |fi — f 2 | from fi to 
f 2 , the search strategy of decreasing r from r 2 step by step 
until / 2 (r|fi) does not decrease is more efficient than the one 
of increasing r from fi step by step until / 2 (r|fi) does not 
increase. 

In the second round of iteration, because f 2 < r 2 , the 
parameter value fi for stage 1 is obtained and shown in Fig. 
9 with a label ”#5”. According to Theorem fill it is true that 
ft < f\. Subsequently, the parameter value r 2 for stage 2 is 
obtained and shown in Fig. 10 with a label “#6”. According 
to Theorem flOl it is true that r 2 < r 2 . 

The iteration stops if the difference between the value / of 
objective function in line 5 of Algorithm 2 and the one / in in 
line 6 of Algorithm 2 is equal to or smaller than the threshold 

H > 0 . 

Decreasing Phenomenon: Fig. 9 shows an interesting phe¬ 
nomenon: (1) Once a new stage 2 is added, the parameter 
r\ of stage 1 should be updated by decreasing n to a smaller 
number fi so that the computation cost is minimized. (2) Once 
the number of stages is fixed, the parameter for each stage 
decreases gradually as iteration proceeds. 
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2) Justification of the Algorithm: Theorems mm are to 
given to theoretically interpret the so-called Decreasing Phe¬ 
nomenon and justify Algorithm 2. Theorem [9] implies that the 
parameter r-\ of stage 1 should be updated by decreasing to a 
small number when the parameter r 2 of stage 2 is fixed. 


Theorem 9. f\ = argmin/ 2 (r|r 2 ) < argmin/i(r) = r\ 

r r 

where r 2 > r\. 


Proof. 

ri = argmin f 2 (r\r 2 ) and n = argmin/i(r), 

. df 2 (r\r 2 ) _ d/i (r) _ n 

dr dr 


h(r) = Pi(r)(r + c) + (1 - pi(r))(T + 2c), 

=pi(r 1 ) + p' 1 {r 1 )(r 1 - T-c). 

V / 2 (r|r 2 ) = pi{r)(r + c) + (l-p 1 (r))[p 2 (r 2 \r)(r 2 +2c) + 
(1 ~P 2 {r 2 \r))(T + 3c)], 

• /i+M = p[{r)(r - r 2 - c) + pi(r) (see f]59)). 

Now investigate the value of f 2 (fi\r 2 ) — f[(r i) = 
Pi{h)(fi -r 2 -c) -pi( r i)( r 'i —T — c) + (pi(fi) ~ Pi(ri)) 


if fi > r i is true: 

■/ f'i > r i is assumed, 

0 < pi(fi) < pi(ri),0 > h - r 2 - c > n - T - 
c,Pi(h) >Pi(ri) > 0, 

■'■Pi(fi)(ri~r 2 -c) >pi(n)(n-T-c),pi(fi) >Pi(jt), 
•■•/2(^il r 2)-/](ri) >0. 

This contradicts f 2 (fi\r 2 ) — /f(ri) = 0. So fi > ri is 
wrong and fi < t*i is true. □ 

As a lemma of Theorem [9j we have the following theorem: 


Theorem 

10. 

If (rf* 

ri*) 

= arg min 

fi(ri,...,ri) 



7*1,... ,7*j 

and 

( 

*0+!) 

' 1 ! • 

*0+!) 
■*> ' i ’ 

'i+1 ) 

= 

arg min 


,n,r i+ 1 ), 

then 7 -*^ +1 ^ 

VI 


As a generalized version of Theorem |9] Theorem [TO] tells 
that once a new stage 7 + 1 is added, all the optimal parameters 
of the existing stages 1,... ,i should be updated and decreased 
so that the computation cost is minimized. 


Theorem 11. If r 2 < r 2 , then f\ = argmin f 2 (r\r 2 ) < 

r 

arg min/ 2 (rjr 2 ) = r x . 

r 


Proof. 

fi = argmin f 2 (r\f 2 ) and r\ = argmin f 2 (r\r 2 ), 

r r 


dh(r\r 2 ) 

dr 


df2(r\r 2 ) 

dr 


= 0 is true. 


Now investigate the value of f 2 {ri\r 2 ) — (r*i|r'2) = 
\p[ (h ) (n - f 2 - c) - p\ (n) (ri - r 2 - c)] + [pi (fi) - pi (n )] 
if fi > 7*1 is true: 

Y fi > ri is assumed. 


.*. 0 < pj(fi) < pi(rr),0 > h - r 2 - c > n - r 2 - 
c,Pi(ri) >Pi(?t) > 0 , 

.’. Pi(fi)(7*i 7*2 c) > p[{ri)(ri-r 2 -c),pi(fi) > pi(ri), 
/ 2 ^il f 2 ) - f 2 (ri\r 2 ) > 0. 

This contradicts f 2 (fi \r 2 ) — f 2 (ri\r 2 ) = 0. So f\ > t*i is 
wrong and fi < t*i is true. □ 


Theorem 12. If fi <7*1, then f 2 = arg min f 2 {r\ff) < 

fi<r<T 

arg min f 2 {r\ri) = r 2 . 
ri <r<T 


Algorithm 3 Globally optimal multi-stage cascade learning. 

Input: 

Strong classifier if(x) = Yli=i a^*( x ) an d its threshold 
t; 


A set of true negative sub-windows{x|/(x) = —1}; 

Output: 

min = /s(tt,....,7\s) where 5 is the 

r*i,...,rs 

number of stages in the final cascade structure; 

1: Search the optimal solution r]' ( ' * 1 2 ^ of /i(r) for stage 1 in the 
range of 1 < r < T: r\ = arg min /i(r). / •<— /i(ri); 

l<r<T 

2 : for i = 2 to S do 

3: Initialize the upper bound r*i,of ri,...,rj_i: 

r“ <— r**- 1 for j = 1,..., i — 1; 

4: Initialize the upper bound r“ of r, by finding r]‘ = 

arg r m A n f . , /* ( r i|r-i (l_1) ,...,r*i\ _1) ) with 
n —r*_2 V 7 


the search range rj > 2r*i\ ^ — r*l* 2 ^ and = 0. 

6: while f — f > e do 

7: f <- /; 

8 : for j = 0 to i do 

9: r* = arg min /jfojr-jf, k ± j)\ 

rj<r. 

— 3 

10: end for 

ll: / *«- fi(rl,-,r*), rj <- r*. 

12: end while 

13: rf rj, j = l,...,r, 

14: end for 

15: return r* -t— r* s , i = 1 ,...,S. 


Proof. See Appendix A. □ 

Theorems [3 HH and [12] can be extended to multi-stage 
cascade. Correspondingly, Decreasing Phenomenon can be 
generalized to Generalized Decreasing Phenomenon and Al¬ 
gorithm 2 can be generalized to Algorithm 3. 

Generalized Decreasing Phenomenon: If the alternative 
optimization algorithm 3 is used to find the globally optimal 
solution (r”, 7*2*,..., r* 1 ) = arg min /j(n,..., rf), then it 

7 * 1 ,... , 7 *£ , 


holds that: 

(1) Once a new stage i + 1 is added, all the optimal 
parameters of the existing stages 1 ,... ,i are updated and 
decreased so that the computation cost is minimized. 

(2) Once the number of stages is fixed, the parameter for 
each stage decreases gradually as iteration proceeds. 

In Algorithm 3, / is the objective function after a new stage 
7 is added while / is the one before stage i is added. That is 
, / is the value of objective function when there are 7—1 
stages. According to Theorem |5] when a new stage i is to be 
added, the optimal solution r,; can be searched by increasing r,; 
from 2r*l\ -1 ' ) — r*!ff 1 ' 1 instead of Because 2^ — 

*(*-!) 

G-2 


' i—l 
*(»-!) 


is much larger than r l _ 1 , the search efficiency is 
very high. The iteration in line 5 of Algorithm 3 stops if the 
difference between / and / is below a threshold e > 0, which 
implies that the algorithm arrives at global minimum solution 
for 7 stages. 

Fig. 10 shows the classification procedure of multi-stage 
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Stage 1 r, 


Stage 2 r 2 


Stage 3 f 3 

I 

I 

Stage 5-1 r s _, 


Stage 5 r s 

▼ 



Fig. 10. The classification procedure of the multi-stage iCascade algorithm. 


iCascade where the partition points ( r\,...,rs ) are given by 
Algorithm 3. If the computation cost of classifying positive 
samples is neglected, the computation cost fg of iCascade 
can be estimated by 


fs = J2 (n + ic ) 


+{T+(S+l)c) 


II (1 -Pj-ifa-i)) pi(n) 

3=1 

r s+i 

n (1 -Pj-i{rj-i)) 

3 =1 


(60) 


C. Threshold learning in iCascade 

Once the number of weak classifiers in each stage is 
determined by Algorithm 3, the parameters affecting the com¬ 
putation cost are the thresholds ti, i = 1,..., S. In this section, 
we give theorem and algorithm for setting the thresholds (ti,..., 

ts )■ 

Theorem [13] tells that the computation cost fg monoton- 
ically decreases with t, and Pi(ti), i = 1 ,...,S. So the 
computation cost can be reduced by decreasing the thresholds 
under the constraint of minimum-acceptable detection rate. 

Theorem 13. fg monotonically decreases with ti and Piiti), 

i = 1 

Proof. See Appendix B. □ 

If the detection rate D = 1 (i.e., all the positive training 
samples are correctly classified) is the constraint, then the 
optimal threshold t* can be expressed as: 

t* = arg min ti , s.t. d(ti) = 1, i = 1,...,S. (61) 

which guarantees D = ULd(t*) = 1. In (ED, d(ti) is the 
detection rate of stage i defined by: 

E J (E «A( X ) > t) 

X E ?(‘i(x) == 1) - <62) 


It is challenging to choose the optimal thresholds if the 
expected detection D < 1. It is well known that the detection 
rate D of the system is the product of the detection rate d{tj) 
of each stage. A popular way to set d(ti) is 

d{ti) = H 1/s , i = l ,..., S. (63) 

However, when the number of stages of iCascade is very 
large, it holds that d(tf) ss 1. Such high d{tf) makes the 
threshold t, very large and the corresponding computation cost 
is very large. 

To deal with the above problem, we propose to use Al¬ 
gorithm 4 for threshold learning. The initial thresholds are 
chosen by d63l l guaranteeing the detection rate D being 1. 
The corresponding initial computation cost is denoted by fg. 
The main issue is to select which stage to decrease its initial 
threshold by a small step A ti. In our algorithm, the derivative 
f' s of the computation cost fg against detection rate I) is 
computed by 


fs(i) « A/ s /AA, (64) 

where A Di is the variation of the system detection rate. Note 
that the variation A Di is caused by changing ti to ti — At, 
while the thresholds tk of other stages (i.e., k ^ i) remain 
unchanged. 

The stage j with the largest derivative is selected and its 
threshold tj is then decreased by the small step A tf 

A/,s 

3 = arg max——, (65) 

i £A Ui 

tj s— tj — A tj , ( 66 ) 

with the thresholds of the stages (i.e., i ^ j ) unchanged. 

Re-compute the computation cost fg and detection rate D 
when tj is updated: 


fs <- fs — A/ 5 , 

(67) 

D -s— D — A Dj. 

( 68 ) 


The step At, is small enough to keep the detection rate D 
smaller than the target detection rate D 0 . 

As shown in Algorithm 4, the iteration of choosing the 
most important stage j = arg max Afg/ADi, updating its 

i 

threshold tj tj — A tj and corresponding computation cost 
fs*—fs — Afg and detection rate D <— D — A Dj runs until 
the updated detection rate D is below the expected detection 
rate D 0 . 


VI. EXPERIMENTAL RESULTS 
A. Experimental Setup 

The classical cascade learning algorithms of Viola and 
Jones (VJ)(a.k.a., Fixed Cascade) J 8 }, Recyling Cascade iflOl 
and Recyling & Retracting Cascade IflOl are compared with 
the proposed iCascade algorithm. The testing dataset is the 
standard MIT-CMU frontal face databaset ||33l , [f 8 l . The pos¬ 
itive training dataset consists of about 20000 normalized face 
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Algorithm 4 Threshold learning algorithm for iCascade. 

Input: 

Expected detection rate D 0 \ 

Positve and negative training samples; 

Strong classifiers iT(x) = or^i(x); 

Output: 

The optimal thresholds f, of all the S stages; 

1: Initialize the thresholds tj for each stage by t, = 
argminfi , s.t. d(ti) = 1, i = 1 ,S so that the system 
detection rate D = 1; 

2: Corresponding to the initial thresholds, the initial compu¬ 
tation cost of the system is computed by (f60b and denoted 
by fs\ 

3: repeat 

4: For each stage, compute the approximation of the 

derivative Afs/ADi of the computation cost fs against 
detection rate D. The variations A fs and A 1), are 
caused by changing ti to ti — Ati while the thresholds 
tk of other stages (i.e., k ^ i) remain unchanged; 

5: From all the S stages, choose the stage j with largest 

derivative j = argrnax Afs/ADi. Then decrease the 

i 

threshold tj of the stage j by a small step Atj : tj <— 
tj — A tj', 

6: Update the computation cost fs and detection rate I): 

fs ^ fs A/s, D D - ADf, 

7 : until D < D 0 

8: return the updated thresholds ti of all the S stages. 


images of size 20x20 pixels. 5000 non-face large images are 
collected from web sites to generate negative training dataset. 
Both of the positive and negative training images can be down¬ 
loaded from https://sites.google.com/site/yanweipang/publica. 

In addition, the intermediate results demonstrating the cor¬ 
rectness of the proposed theorems are given in Section VI.B. 

A strong classifier fT(x) = Y/J=i ctihifx) is considered 
input of iCascade. The strong classifier is obtained by standard 
AdaBoost algorithm without designing of cascade structure. 

B. Intermediate Results of iCascade 

Some intermediate results are shown in this section. These 
results show the rationality of the assumptions and the cor¬ 
rectness of the proposed theorems. 

1) Local-Minimum Based Cascade: In Section III, the 
regular strong AdaBoost classifier is divided into /Tl(x, r) and 
H r(x, r) to reject some negative sub-windows earlier, and the 
key problem is to determine an optimal r to minimize the 
computation cost. To solve this problem, it is necessary to 
reveal the relationship between r and the negative rejection 
rate p. 

In this part, with the training dataset described in Section 
VI.A, we train a regular strong AdaBoost classifier and split 
it into two parts by r, which varies 1 to T. In the case that 
detection rate is fixed at 1, Fig. 11 shows that the negative 
rejection rate p increases with r. p first grows quickly from 0 to 
0.96 when r changes from 1 to a small value r* = 80, and then 
becomes stable when r is larger than r* . Thus, we can model 



Fig. 11. The negative rejection rate p(r) 



Fig. 12. The computation cost f(r) 

p(r) by combining of two linear functions: pi(r) = 0.012r 
with r < r* and P 2 {r) = 1 with r > r*. Fig. 11 demonstrates 
the rationality of (IT6l i- (l25] >. Fig. 12 shows that the computation 
cost / first decreases and then increases with r, and the unique 
minimum is nearby r*. Fig. 12 experimentally proves the 
correctness of Theorem [j] and Theorem [2] 

When we split the regular strong AdaBoost classifier into 
LLl(x, 7'i) and fL/{(x, rr), the sub-windows not rejected by 
stage 1 are fed to stage 2. Then we can divide Hji(x,ri) 
into two parts to form a 2-stage cascade. In this process, we 
should know some properties of the negative rejection rate of 
stage 2 (i.e., p(r|ri)). Fig. 13 shows how p{r\r\) changes with 
r, where the curves of p(r\ri) when n = 10 and r± = 30 
are given, respectively. p{r\r{) has the similar characteristics 
with p(r). Fig. 14 shows how the derivative curves of p(r\ri) 
change with r. Obviously, when rj < n, pir^r-i) > p(r 2 l^i) 
and p'(r 2 |ri) < p'{r 2 \fi)- Fig. 13 and 14 directly support the 
correctness of m-m. 

We use the local-minimum based multi-stage cascade learn¬ 
ing method (see Fig. 4) to train an 8-stage cascade classifier. 
Table 1 shows how the computation cost / changes with the 
number of stages. The computation cost first decreases quickly 
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Fig. 13. Some properties of pfr^ |ri) 



Fig. 14. The derivative of p(r 2 \ri) 

and then becomes stable. This phenomenon can be understood 
easily, because the first few stages can reject the most part of 
the sub-windows, and then only some small part of the sub¬ 
windows arrive at last few stages which don’t produce much 
computation cost. 

2) Joint-Minimum Based Cascade: In the local-minimum 
based multi-stage cascade(i.e.. Fig. 4), it seeks an optimal 
Ti on the condition that (rr..... r,;_ i) are known and fixed, 
so (ri,...,rj_i,rj) can’t be jointly optimal for minimizing 
the computation cost f(ri,...,ri) where not only ry but also 
are variable. Thus, Algorithm 3 is proposed to 
train the joint-minimum based multi-stage cascade. 

Fig. 15 shows the iteration process of Algorithm 3. The 
number 48 on the top blue line is rj 1 ' 1 ' 1 = argmin/i(r), 
which is the result of line 1 of Algorithm 3. The right number 
172 on the top red line is rlf = argmin^^lr*^) = 172 
(see line 3 of Algorithm 3). Obviously r% = 172 is the 
solution of local-minimum based optimization. The number 
17 and 82 on the second blue line are solutions of joint- 
minimum based optimization (i.e., line 13 of Algorithm 3). 
Generally, the right most number on each red line is the 
upper bound r“ of Algorithm 3, and the number on each blue 



Fig. 15. Generalized decreasing phenomenon in joint-minimum based multi¬ 
stage cascade 


line are the solutions of joint-minimum based optimization 
r* 1 , j = 1,..., i. The generalized decreasing phenomenon can 
be found from Fig. 15. For example, n decreases from 48 to 
17, 12; r 2 decreases from 172 to 82, 45, 33; r% decreases 
from 180 to 172, 69, 60. Table 2 gives the computation 
cost of the cascade corresponding to Fig. 15. In Table 2, 
flocal (1) = 74.42 is the computation cost flirty with r\ = 48, 
and flocality = 52.87 is equal to /^(r^lrr) with r± = 48 and 
local optimization solution is r -2 = 172. Generally, flocality 
means the computation cost /(rj|rr,..., rj_i). In Table 2, 
fjointity is the computation cost /(rr,..., rty of the proposed 
joint-minimum algorithm where rr,..., r, are all unknown 
and i is the total number of the stages of the cascade. Note that 
fjoint (1) = / locality, because there is only one stage in the 
cascade. However, fjointity = 37.83 and fjointity = 26.67 
are much less than fi OC aiity = 52.87 and fi oca ii 3) = 31.24, 
respectively. 

To compare the joint-minimum Algorithm 3 with the local- 
minimum algorithm (see Fig. 4), we visualize fjoint in Table 
II and / in Table I in Fig. 16. With the number of stages 
increasing, the computation costs decrease. But the difference 
is that the joint-minimum based algorithm decreases more 
quickly than the local-minimum algorithm. For example, when 
the numbers of stages are 3 and 8, the computation costs of the 
joint-minimum and local-minimum algorithms are (26.67 and 
18.99) and (52.16 and 52.14), respectively. In summary. Fig. 

16 demonstrates the advantage and importance of the proposed 
joint-minimum optimization algorithm. 

3) Threshold learning: The thresholds ti,i = 1 
affect the computation cost of iCascade. Algorithm 4 gives 
the iteration process to choose the threshold of each stage 
for iCascade. Note that the variation A Di of detection rate 
is obtained by changing U to ti — At,-. As At,; gradually de¬ 
creases, the detection accuracy increases whereas the training 
time drastically grows. A set of ti is evaluated. We find that 
the performance is stably good if ti < 0.02. As a tradeoff, 
ti = 0.01 is empirically employed. Fig. 17 shows how the 
computation cost updates in the iteration process of the first 
20 stages’ thresholds. It can be seen that the computation 
cost significantly decreases with the iteration. In addition. Fig. 

17 shows the convergence of the proposed threshold learning 
algorithm. Fig. 17 supports the correctness of Theorem fl3l 
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TABLE I 

Computation cost / of the algorithm in Fig. 4 varies with the number s of stages 


s 

1 

2 

3 

4 

5 

6 

7 

8 

f 

74.42 

52.87 

52.16 

52.15 

52.14 

52.14 

52.14 

52.14 


TABLE II 

Computation cost of the cascade trained by Algorithm 3 


stage number i 

i 

2 

3 

4 

5 

6 

7 

8 

9 

flocal (®) 

74.42 

52.87 

31.24 

26.00 

22.16 

21.99 

21.26 

21.19 

18.98 

fjoint(0 

74.42 

37.83 

26.67 

22.70 

22.06 

21.31 

21.20 

18.99 

18.38 



Fig. 16. Comparison of the computation cost between local-minimum based 
multi-stage cascade and joint-minimum based one 



Fig. 17. The computation cost decreases with updation of threshold ti 

C. Comparison With Other Algorithms 

In this section, we compare iCascade with some other 
algorithms, including Fixed Cascade GO. Recyling Cascade 
na and Recyling & Retracting Cascade Col. 

Fixed Cascade is proposed by Viola and Jones. ’’Fixed” 
means that the detection rate di and the false detection rate 
fi of each stage is same and fixed, respectively. If the target 
detection rate of the cascade is D, the target rejection rate 


is F and the number of the stages is N, then di = D 1 / Ar 
and fi = /-’ I,/;V . In Recyling Cascade, the score from the 
previous strong classifier stages serves as a starting point for 
the score of the new strong classifier stage. The benefit of 
Recyling Cascade is the reduction of the number of the weak 
classifiers in the strong classifier stages and the reduction of 
the computation cost. The side-effect of Recyling Cascade is 
that the last stage of cascade can serve as an accurate strong 
classifier. Recyling & Retracting Cascade chooses a threshold 
after each weak classifier of the strong classifier is produced 
by Recyling Cascade to reject some negative sub-windows. 
To set these thresholds, it evaluates each score on the set of 
the positive examples and chooses the minimum score as the 
threshold so that all the positive examples in the set can pass 
all the weak classifiers. 

These algorithms are evaluated on the standard MIT-CMU 
frontal face database ea, go, which consists of 125 grayscale 
images containing 483 labeled frontal faces. If the detected 
rectangle and the ground-truth rectangle are at least 50 percent 
of overlap, we call the detected rectangle a correct detection. 
The number of average features per window is used to rep¬ 
resent the computation cost. Fig. 18 reflects the computation 
cost of different algorithms (i.e., iCascade, Recyling Cascade 
and Retracting & Recyling Cascade) as a function of image 
location. The number of the average features used in a sliding 
window is accumulated to the center pixel of this sliding 
window. After detection, the value of each pixel is normalized 
to 0-255. The larger the value is, the greater the computation 
cost is, and the greater the probability that the face exists here 
is. It can be observed that Fig. 18(d) (i.e., iCascade) is much 
darker and sparser than Fig. 18(b) and (c). The darkness and 
sparisity imply that iCascade consumes less computation cost 
than the other two algorithms. 

Fig. 19 shows the average number of features applied per 
window of different methods at different expected detection 
rates Dq. For example, when the detection rate is 0.98, 
iCascade averagely uses 5.95 features, wheras Fixed Cascade, 
Recyling Cascade and Reyling & Retracting Cascade use 
22.84, 20.78 and 13.32, respectively. Fig. 20 shows the ROC 
of the different algorithms. The detection performance of 
different methods is no significant difference. From Fig. 19 
and 20, we can conclude that iCascade has less computation 
cost with no loss of detection performance. 

































15 



(a) Original image 



(b) Recyling response image 


(c) Recyling & Retracting re- (d) iCascade response image 
sponse image 

Fig. 18. The computation cost shown as a function of image location. 



Detection Rate D Q 


multi-stage cascade structure by iteratively partitioning the 
right parts. Solid theories have been provided to guarantee the 
existence and uniqueness of the optimal partition point with 
the goal of minimizing computation cost of the designed cas¬ 
cade classifier. Decreasing phenomenon has been discovered 
and theoretically justified for efficiently searching the optimal 
solutions. In addition, we have presented an effective algorithm 
for learning the optimal threshold of each stage classifier. 


Appendix A: Proof of Theorem[I21 

Proof. .' ?2 = argmin/2(r|fi) and r2 = argmiii/2(r|ri), 

r r 

■ tMpihl _ df2(r 2 \ ri ) = p is 

dr 2 dr 2 

Now investigate the value of /^(f^| cl) — ^2( r 2|^i) = 
[pi(ri)(fi - f 2 - c) -pi(n)(n - r 2 - c)] + [pi(fi) - 

Pi(ri)] if f 2 > r 2 is true: 

f2 > r2 and fi < ri are assumed, 

.'. 0 < Pi( r i) < Pi(f i)> fi — f 2 — c < ri — r 2 — c < 0, 0 < 
Pii.fi) <_pi(ri), 

.'. p'iifi){fi - r 2 - c) < pi(ri)(ri - r 2 - c),pi(fi) < 
Piin)- 

fttfolfi) - /2O2 |n) < 0. 

This contradicts /^(r^lfi) — /2( r 2| r i) = 0- Therefore, r2 > 
r2 is wrong and r2 < r2 is true. □ 


Appendix B: Proof of Theorem[T31 




Fig. 19. Comparison of the computation cost between different algorithms 



Fig. 20. ROC of different algorithms 


Proof. It is straightforward that pt(ti ) monotonically in¬ 
creases with t-i . Suppose we increase tk in stage k from tf. 
to t b k with tf. < t b k while the thresholds U in other stages 
(i.e., i k) are fixed. Correspondingly, the rejection rate 
Pk{tk) grows from p s k to p k and the computation cost fs 
changes from to fs{t b k)- Theorem fl3l can be proved 

if fsitV) ~ fs(t b k ) >0. 


Define / fe L = E,=i( r » + *c) IlLi 0-~ Pj-1 ) 


fs = ft- i + E in + ic) 

i=k 


Pi, we have 


+(T + (S + l)c) 


n (f -Pj-i) 

3 =1 


+(T+(S+l)c) 


VII. Conclusion 



n (i -pj-i) 

j-k+l 


Pi 


T’kPk 


+ fk-V 


In this paper, we have proposed to design a one-stage 

cascade structure by partitioning a strong classifier into left Because f k _\ is independent to the threshold parameter in 
and right parts. Moreover, we have proposed to design a stage k, so the difference between fsitff) and fsifX) i s a l so 
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independent to f k _ v Therefore, 


- f s (t b k ) 

= n (! -Pj-i) 


{(r k + kc)(p% 


3=1 J 

+(r fc+ i + (k + 1 )c) [(1 - PfcK +1 


p\) 

(l-Pk)Pk+ 1 ] 


+(r s + 5 c) 
+(T+(5 + l)c) 


Ps ft (l-P, s -i)-Ps ft (l-pj-t) 

7 =fc+l j=fc+l 

5+1 5+1 

n (i-pj-t)- n (i-pj-t) 

j=k +i j=k+i 

(69) 


Some items of (1691) can be measured by using the fact 
that iCascade can reject all the true negative sub-windows 
in training data and the total rejection rate R is 1. The 
rejection rate R consists of the total rejection rate Rk-i = 
Eti n;=i(i — Pj~i) Pi of the first k — 1 stages and the 
total rejection rate R k _ x = ]+ =fc ]T=i 0 “ Pj-i) 


nf+i 1 (1 — Pj-i) of stages k ,..., S. That is, 

r = t, n (! -P.7-1) 

*= 1 I 7=1 

k-1 

= E 

i=l 

+n?^ A (i-Pi-o 

= = 1 . 


Pi 


Pi+nf=i 1 (!—Pf-i) 


i 

S 

i 

ft (i -pj-i) 

Pi + E 

ft (i -Pj-i) 

3=1 

i=k 

3=1 


Pi 


Because the total rejection rate j of the first k— 1 stages 
does not change with t k , so the total rejection rate R k - 1 of 
stages k,..., S is a constant 77 , no matter how tk varies. So if 
we denote the rejection rates R k _1 corresponding to p s k and 
Pj. by 5 fe _i(+) and i(t£), respectively, then R k _i{t%) = 
R k-i{t b k ) = V- 


Therefore, when grows form fj 


to + 


Pk 


will increase 


from p s k to p k , the rejection rate in stage k will increase while 
the rejection rates in stages k + 1 ,S will decrease or not 


change (i.e.. 


and nti 1 (1 -P 7 - 1 ) > n-+ 0- -pS-i))- So we have 


n.U(i -ps- 


3 - 1 ) 

S+l 


pi > 


n’=i(i-p$-t) 


pi 


ri (i-pj-.) 

Pi> 

ri (i-p?-.) 

J=k+1 


j=k+l 


(70) 


where i = k + 1,5. 

Based on (f70t . fs(t k ) — fs(t k ) in (|69| i satisfies: 


fs(n) - f s {t b k ) 

> n-=i (i-Pf-i) +fc(pfc-Pfc) 

+rk [(1 -P s k)p s k+1 - (! -Pfc)Pfc+l] 


+Tk P% ft (1-Pj_t)-Ps ft (1 -P|-t) 

j=k+1 j=k+1 

" 5+1 5+1 

+r fe n (i-p|-i)- n (i-pj- 1 ) 

7=fc+l 7=fc+l 


k 

\ 

ft f-Pj-i) 

r k { 

3=1 

\ 


Z_/ X- 'i xx 

i—k-\-1 j=fc+l 


pt+ E p\ ft (i-Pi-t) 

z=fc+l j=fc+l 


5+1 

- n (i-pj-i)- 

j=fe+i 

s+l 

n (i-Pi-i) 

j=k+1 j j 

= r k - 5 fc _i(f^)| = r k {p - 77 } = 0 
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